Exploring Data Science Education: From Tutorials to Assessment

Duke Statistical Science | Graduation with Distinction

Evan Dragich
supervised by Mine Çetinkaya-Rundel, PhD.

April 11, 2023

Introduction/ About Me

  • Basic demographic background information
  • How i found myself in Duke StatSci (leads into)
  • How i found myself working on this thesis

Agenda/Thesis TOC

  • explanation of the 2 strands
  • summary of following slides (and stuff like there will be time for questions, etc.)

Building a Data Science Assessment

Background

  • Colloquially motivate the need for a DS concept inventory
    • CAOS, has 418 citations!
  • Data Science as it emerges as a field–what is it, exactly?
    • colloquially explain Cetinkaya Rundel/Ellison and Zhang/Zhang findings.
  • How exactly do people: (1) make, (2) pilot, (3) validate new concept inventories or scales?
    • colloquially explain Jorion papers

Inital cleaning/getting my feet wet

  • Should i even present on this? Is any of it interesting/worth diving into?

Interviews

  • reminder of what we tried to do (3 faculty to see if scope was appropriate from instructor perspective, 3 TA to see if questions were landing on a closer-to-target population, but still with some DS context)
  • Summarize results from faculty, and themes (mainly that you could clearly tell the CS prof from the two stats ones via what they thought should be included, things pointed out like uncomfortable contexts (storm paths?))

Current Prototype

  • 15 passages, 26 items
Passage Learning Objective(s)
Storm Paths modeling; simulation; uncertainty
Movie Budgets 1 compare summary statistics visually
Movie Budgets 2 modeling; \(R^2\); compare trends visually
Application Screening ethics; modeling; proxy variable
Banana Conclusions causation; statistical communication
COVID Map complex visualization; spatial data; time series; sophisticated scales
He Said She Said basic visualization; sophisticated scales
Build-a-Plot data to visualization process
Disease Screening compare classification diagnostics visually
Realty Tree modeling; regression tree; variable selection
Website Testing compare trends visually; uncertainty; modeling; time series; extrapolation
Image Recognition ethics; modeling; representativeness of training data
Data Confidentiality ethics; data deidentification; statistical communication
Activity Journal structure data; store data
Movie Wrangling data cleaning; data wrangling; column-wise string operations; pseudocode; joins

Case Study: Application Screening

start with AS: a question based on proxy variable.

You are working on a team that is making a deterministic model to quickly screen through applications for a new position at the company. Based on employment laws, your model may not include variables such as age, race, and gender, which could be potentially discriminatory.

Your colleague suggests including a rule that eliminates candidates with more than 20 years of previous work experience, because they may have high salary expectations. Why might using this variable be considered unethical? Explain your answer.

Oops, best practices would phrase this in a non-leading way. If a student wasn’t initially going to think this would be unethical, but we told them it might be somehow, their explanation won’t be as valuable as someone who would have answered right away. Okay, let’s rephrase to make them answer whether it is or isnt:

Case Study: Application Screening

You are working on a team that is making a deterministic model to quickly screen through applications for a new position at the company. Based on employment laws, your model may not include variables such as age, race, and gender, which could be potentially discriminatory.

Your colleague suggests including a rule that eliminates candidates with more than 20 years of previous work experience, because they may have high salary expectations. Are there ethical implications of using this variable to select candidates? Explain your answer.

Well… that doesn’t help much. We still have the classic selection bias clouding results; students would think “well, if there wasn’t an ethical problem, they wouldnt have included this as one of the only ethics questions on the assessment.” Plus, how are we grading this? What are we looking for to confirm that they understand the proxy variable? It might work to set up an autograder that marks “correct” if they mark “yes” to the ethics question AND mention “proxy” in their response. But, this is an introductory-level assessment. Will students be able to concisely describe employment expeirence as a “proxy,” or would explanations be wordier and might include “is correlated with,” “is related to,” “goes hand in hand with,” “predicts.” If we want autogradable, looking like MC is going to be main way to go. Note that, at this stage, Application Screening and several other similar questions are left in open-ended format to collect this type of data from the pilots.

Case Study: Data Confidentiality

so, after that whole mess, a general conclusion is that MC might be the only way to go. so how do we write a good MC ethics question? here’s a start, with the focus on identifiable data and statistical communication:

A newspaper reports on the results of a survey from a small (<2000 student) college. The college agrees to have the data released to the public so long as the students’ identities and academic standing information are kept confidential. Which of the following combinations of variables is less likely to unintentionally identify any students? Explain.

a. Year, major, sports played

b. Year, major

Well, first of all, is “college” the best word here? While it’s roughly synonymous with “university” in the US, they can have very different meanings country-to-country. Thus, let’s eliminate any ambiguity right off the bat.

Case Study: Data Confidentiality

A newspaper reports on the results of a survey from a small (<2000 student) university. The university agrees to have the data released to the public so long as the students’ identities and academic standing information are kept confidential. Which of the following combinations of variables is less likely to unintentionally identify any students? Explain.

a. Year, major, sports played

b. Year, major

Great! There is no issue with grading this on a large scale, as students will simply choose option “a” to be marked correct. But, how valid is this binary comparison of two nearly-identical options in measuring students’ idea of data privacy (and key variables whose intersections can quickly narrow down populations). Would they choose “a” simply because of the “presence vs absence” heuristic, similar to the selection bias issue addressed earlier, or because they understand how it quickly narrows down who a respondent could be? This question took a lot of brainstorming and workshopping, and we ultimately landed on the following options:

Case Study: Data Confidentiality

A newspaper reports on the results of a survey from a small (<2000 student) university. The university agrees to have the data released to the public so long as the students’ identities and academic standing information are kept confidential. Which of the following combinations of variables is less likely to unintentionally identify any students? Explain.

a. Class year and sports played

b. Student ID and dorm zip code

c. GPA and major

d. Birth date and phone number

e. None of the above

Case Study: Movie Budgets 1

A data scientist at IMDb has been given a dataset comprised of the revenues and budgets for 2,349 movies made between 1986 and 2016.

Suppose they want to compare several distributional features of the budgets among four different genres—Horror, Drama, Action, and Animation. To do this, they create the following plots.

Fill in the following table by placing a check mark in the cells corresponding to the attributes of the data that can be determined by examining each of the plots.

Plot A Plot B Plot C Plot D
Mean
Median
IQR
Shape

Assessment Next Steps

  • 199 Pilot
  • IRB Roadblocks
  • NSF Grant?
    • Turn into more robust JS framework like CAOS is

Working on the dsbox package

dsbox package

  • Reference growing DS interest and scalability of education from the assessment talk earlier; that plus the opensource nature of R lends naturally to making such a standardized curriculum

  • What is Data Science in a Box? Its that ^. Using tidyverse to practice basic data wrangling, visualization, and modeling.

  • That curriculum set was then condensed into a package for self-learners called dsbox, which users can download and follow to become well-acquainted with basic data science in R.

How does it work?

  • 2 key packages: learnr and gradethis.

  • learnr provides a robust framework for turning RMarkdown documents into interactive tutorials, where users can be guided through running and writing code, quiz questions, watching videos, etc, directly in the “Tutorial” pane in RStudio. A key feature is that progress is saved, so you can resume working in RStudio whenever.

  • gradethis takes that basic, broad framework, and provides tools for drilling down deeper when grading. Instructors can provide feedback for a variety of common mistakes with sophisticated testing logic.

Creating a Tutorial

  • 9 existing, skeleton for 1 (these corresponded to all HWs from DSinaBox; the package had skeleton .Rd files for the dataset already).

  • Decided I would try to recreate that tutorial, adding in some flair and thoughts based on best practices

    • scaffolded one of the exercises more to match the “interactive tutorial” modality vs the “take home, class homework assignment”

Sample Tutorial: Home Page

Sample Tutorial: Code chunk with hint

Sample Tutorial: Opening the hood

```{r common-themes, exercise = TRUE}
lego_sales |>
  ___(___)
```
```{r common-themes-hint-1}
Look at the previous question for help!
```
```{r common-themes-solution}
lego_sales |>
  count(theme, sort = TRUE)
```

Sample Tutorial: Opening the hood

```{r common-themes-check}
grade_this({
  if(identical(as.character(.result[1,1]), "Star Wars")) {
    pass("You have counted themes and sorted the counts correctly.")
  }
  if(identical(as.character(.result[1,1]), "Advanced Models ")) {
    fail("Did you forget to sort the counts in descending order?")
  }
  if(identical(as.character(.result[1,1]), "Classic")) {
    fail("Did you accidentally sort the counts in ascending order?")
  }
  if(identical(as.character(.result[1,1]), "Adventure Camp")) {
    fail("Did you count subthemes instead of themes?")
  }
  if(identical(as.numeric(.result[1,2]), 172)) {
    fail("Did you count subthemes instead of themes?")
  }
  fail("Not quite. Take a peek at the hint!")
})
```

Releasing to CRAN

  • Explaining what CRAN is

  • Explaining what the DESCRIPTION folder and dependencies are

  • Unfortunately, gradethis is still in development and thus not yet released on CRAN itself. In turn, we are unable to upload a package that specifies a package not on CRAN to be one of its dependencies. We have submitted an issue

Discussion

Learning Takeaways

  • Learned advanced computing I wouldn’t have gotten otherwise in my abbreviated trip through the Stat major

  • Learned how to interact with others’ code beyond scope of classroom/research team (making and reviewing public PR requests, compiling and standardizing and revising the assessment and package)

Reflections

  • Statement that “teaching material is only way to master it” had always been true for my tutoring and TAing experiences; developing and studying a curriculum to the point of scrutiny took my understanding to the next level

  • Newfound appreciation for work that has gone into all the educational curriculum materials today, tools like ghclass and learnr, all packages that have been released and are maintained and what it takes to do that.

  • Inspired me to continue interacting with the world of open source software even though my job (for the meantime) is to be an Excel monkey for 40 hours a week

Q&A